pip install gdown
Requirement already satisfied: gdown in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (4.2.0) Requirement already satisfied: six in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gdown) (1.16.0) Requirement already satisfied: beautifulsoup4 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gdown) (4.10.0) Requirement already satisfied: tqdm in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gdown) (4.62.3) Requirement already satisfied: requests[socks] in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gdown) (2.26.0) Requirement already satisfied: filelock in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gdown) (3.4.0) Requirement already satisfied: soupsieve>1.2 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from beautifulsoup4->gdown) (2.3.1) Requirement already satisfied: idna<4,>=2.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests[socks]->gdown) (3.3) Requirement already satisfied: charset-normalizer~=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests[socks]->gdown) (2.0.9) Requirement already satisfied: certifi>=2017.4.17 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests[socks]->gdown) (2021.10.8) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests[socks]->gdown) (1.26.7) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests[socks]->gdown) (1.7.1) Note: you may need to restart the kernel to use updated packages.
!pip install ipython-autotime
%load_ext autotime
Requirement already satisfied: ipython-autotime in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (0.3.1) Requirement already satisfied: ipython in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython-autotime) (7.29.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (3.0.22) Requirement already satisfied: setuptools>=18.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (59.2.0) Requirement already satisfied: pickleshare in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.7.5) Requirement already satisfied: pexpect>4.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (4.8.0) Requirement already satisfied: decorator in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (5.1.0) Requirement already satisfied: traitlets>=4.2 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (5.1.1) Requirement already satisfied: pygments in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (2.10.0) Requirement already satisfied: matplotlib-inline in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.1.3) Requirement already satisfied: jedi>=0.16 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.18.1) Requirement already satisfied: backcall in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.2.0) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from jedi>=0.16->ipython->ipython-autotime) (0.8.2) Requirement already satisfied: ptyprocess>=0.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from pexpect>4.3->ipython->ipython-autotime) (0.7.0) Requirement already satisfied: wcwidth in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->ipython-autotime) (0.2.5) time: 424 µs (started: 2021-12-05 12:03:12 +00:00)
pip install gensim
Requirement already satisfied: gensim in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (4.1.2) Requirement already satisfied: numpy>=1.17.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gensim) (1.20.0) Requirement already satisfied: scipy>=0.18.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gensim) (1.7.3) Requirement already satisfied: smart-open>=1.8.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from gensim) (5.2.1) Note: you may need to restart the kernel to use updated packages. time: 1.32 s (started: 2021-12-04 23:17:18 +00:00)
pip install pandas
Requirement already satisfied: pandas in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (1.3.4) Requirement already satisfied: pytz>=2017.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from pandas) (2021.3) Requirement already satisfied: python-dateutil>=2.7.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from pandas) (2.8.2) Requirement already satisfied: numpy>=1.17.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from pandas) (1.20.0) Requirement already satisfied: six>=1.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages. time: 1.61 s (started: 2021-12-04 23:17:20 +00:00)
pip install -U scikit-learn
Requirement already satisfied: scikit-learn in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from scikit-learn) (3.0.0) Requirement already satisfied: scipy>=1.1.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from scikit-learn) (1.7.3) Requirement already satisfied: joblib>=0.11 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from scikit-learn) (1.1.0) Requirement already satisfied: numpy>=1.14.6 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from scikit-learn) (1.20.0) Note: you may need to restart the kernel to use updated packages. time: 1.57 s (started: 2021-12-04 23:17:21 +00:00)
pip install torch torchvision torchaudio
Requirement already satisfied: torch in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (1.10.0) Requirement already satisfied: torchvision in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (0.11.1) Requirement already satisfied: torchaudio in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (0.10.0) Requirement already satisfied: typing-extensions in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from torch) (4.0.1) Requirement already satisfied: numpy in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from torchvision) (1.20.0) Requirement already satisfied: pillow!=8.3.0,>=5.3.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from torchvision) (8.4.0) Note: you may need to restart the kernel to use updated packages. time: 1.22 s (started: 2021-12-04 23:17:23 +00:00)
pip install tables
Requirement already satisfied: tables in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (3.6.1) Requirement already satisfied: numpy>=1.9.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from tables) (1.20.0) Requirement already satisfied: numexpr>=2.6.2 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from tables) (2.8.0) Note: you may need to restart the kernel to use updated packages. time: 1.28 s (started: 2021-12-04 23:17:24 +00:00)
pip install matplotlib
Requirement already satisfied: matplotlib in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (3.5.0) Requirement already satisfied: python-dateutil>=2.7 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: pyparsing>=2.2.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (3.0.6) Requirement already satisfied: fonttools>=4.22.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (4.28.3) Requirement already satisfied: pillow>=6.2.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (8.4.0) Requirement already satisfied: setuptools-scm>=4 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (6.3.2) Requirement already satisfied: packaging>=20.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (21.3) Requirement already satisfied: cycler>=0.10 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (1.3.2) Requirement already satisfied: numpy>=1.17 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib) (1.20.0) Requirement already satisfied: six>=1.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Requirement already satisfied: setuptools in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from setuptools-scm>=4->matplotlib) (59.2.0) Requirement already satisfied: tomli>=1.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from setuptools-scm>=4->matplotlib) (1.2.2) Note: you may need to restart the kernel to use updated packages. time: 1.38 s (started: 2021-12-04 23:17:26 +00:00)
# Do not execute if loading the embeddings from drive
! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
! gdown 'https://drive.google.com/uc?id=15qqiiENEWU6UuhKb4k1JDiPKsW4GJAaw'
--2021-12-04 23:17:38-- https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.1.102
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.1.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: 'GoogleNews-vectors-negative300.bin.gz'
Goog 38%[======> ] 603.61M 78.0MB/s eta 19s ^C
Downloading...
From: https://drive.google.com/uc?id=15qqiiENEWU6UuhKb4k1JDiPKsW4GJAaw
To: /home/studio-lab-user/project/glove.6B.300d.word2vec
14%|█████▎ | 143M/1.04G [00:02<00:10, 89.3MB/s]^C
Traceback (most recent call last):
File "/home/studio-lab-user/.conda/envs/default/bin/gdown", line 8, in <module>
sys.exit(main())
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/gdown/cli.py", line 145, in main
filename = download(
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/gdown/download.py", line 241, in download
for chunk in res.iter_content(chunk_size=CHUNK_SIZE):
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/requests/models.py", line 758, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/http/client.py", line 462, in read
n = self.readinto(b)
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/http/client.py", line 506, in readinto
n = self.fp.readinto(b)
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
14%|█████▎ | 143M/1.04G [00:02<00:16, 55.2MB/s]
time: 16.6 s (started: 2021-12-04 23:17:38 +00:00)
! gdown 'https://drive.google.com/uc?id=1zTDcgnFtQWUeXgIce4O2Z2OPn0vJH1aD'
Downloading... From: https://drive.google.com/uc?id=1zTDcgnFtQWUeXgIce4O2Z2OPn0vJH1aD To: /home/studio-lab-user/project/utils_py.py 100%|██████████████████████████████████████| 6.69k/6.69k [00:00<00:00, 9.43MB/s] time: 1.86 s (started: 2021-12-04 23:19:49 +00:00)
pip install transformers
Requirement already satisfied: transformers in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (4.12.5) Requirement already satisfied: sacremoses in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (0.0.46) Requirement already satisfied: requests in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (2.26.0) Requirement already satisfied: numpy>=1.17 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (1.20.0) Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (0.2.1) Requirement already satisfied: pyyaml>=5.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (6.0) Requirement already satisfied: tqdm>=4.27 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (4.62.3) Requirement already satisfied: packaging>=20.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (21.3) Requirement already satisfied: filelock in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (3.4.0) Requirement already satisfied: regex!=2019.12.17 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (2021.11.10) Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from transformers) (0.10.3) Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.0.1) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from packaging>=20.0->transformers) (3.0.6) Requirement already satisfied: certifi>=2017.4.17 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests->transformers) (2021.10.8) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests->transformers) (1.26.7) Requirement already satisfied: idna<4,>=2.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests->transformers) (3.3) Requirement already satisfied: charset-normalizer~=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from requests->transformers) (2.0.9) Requirement already satisfied: joblib in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from sacremoses->transformers) (1.1.0) Requirement already satisfied: six in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from sacremoses->transformers) (1.16.0) Requirement already satisfied: click in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from sacremoses->transformers) (8.0.3) Note: you may need to restart the kernel to use updated packages. time: 1.53 s (started: 2021-12-04 23:19:51 +00:00)
pip install wordcloud
Requirement already satisfied: wordcloud in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (1.8.1) Requirement already satisfied: numpy>=1.6.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from wordcloud) (1.20.0) Requirement already satisfied: matplotlib in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from wordcloud) (3.5.0) Requirement already satisfied: pillow in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from wordcloud) (8.4.0) Requirement already satisfied: fonttools>=4.22.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.28.3) Requirement already satisfied: packaging>=20.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3) Requirement already satisfied: pyparsing>=2.2.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.6) Requirement already satisfied: python-dateutil>=2.7 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: cycler>=0.10 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.3.2) Requirement already satisfied: setuptools-scm>=4 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from matplotlib->wordcloud) (6.3.2) Requirement already satisfied: six>=1.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0) Requirement already satisfied: tomli>=1.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from setuptools-scm>=4->matplotlib->wordcloud) (1.2.2) Requirement already satisfied: setuptools in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from setuptools-scm>=4->matplotlib->wordcloud) (59.2.0) Note: you may need to restart the kernel to use updated packages.
!pip install ipython-autotime
%load_ext autotime
Requirement already satisfied: ipython-autotime in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (0.3.1) Requirement already satisfied: ipython in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython-autotime) (7.29.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (3.0.22) Requirement already satisfied: jedi>=0.16 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.18.1) Requirement already satisfied: pygments in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (2.10.0) Requirement already satisfied: traitlets>=4.2 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (5.1.1) Requirement already satisfied: pickleshare in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.7.5) Requirement already satisfied: matplotlib-inline in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.1.3) Requirement already satisfied: pexpect>4.3 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (4.8.0) Requirement already satisfied: backcall in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (0.2.0) Requirement already satisfied: setuptools>=18.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (59.2.0) Requirement already satisfied: decorator in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from ipython->ipython-autotime) (5.1.0) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from jedi>=0.16->ipython->ipython-autotime) (0.8.2) Requirement already satisfied: ptyprocess>=0.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from pexpect>4.3->ipython->ipython-autotime) (0.7.0) Requirement already satisfied: wcwidth in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->ipython-autotime) (0.2.5) time: 374 µs (started: 2021-12-05 13:16:18 +00:00)
! gdown "https://drive.google.com/uc?id=1PSjnBVZM_hmM0jtQrZ7QNuzTeWDTPuHM"
! gdown "https://drive.google.com/uc?id=1hZmVoYIyrY_oIFYe_zA3oPFkttom8Kc1"
! gdown "https://drive.google.com/uc?id=1vKduM6oCoGWWiXZrn1P33rnu74YbJJ2T"
Downloading... From: https://drive.google.com/uc?id=1PSjnBVZM_hmM0jtQrZ7QNuzTeWDTPuHM To: /home/studio-lab-user/project/bbc.csv 100%|██████████████████████████████████████| 5.10M/5.10M [00:00<00:00, 34.2MB/s] Downloading... From: https://drive.google.com/uc?id=1hZmVoYIyrY_oIFYe_zA3oPFkttom8Kc1 To: /home/studio-lab-user/project/classic3.csv 100%|██████████████████████████████████████| 3.84M/3.84M [00:00<00:00, 44.3MB/s] Downloading... From: https://drive.google.com/uc?id=1vKduM6oCoGWWiXZrn1P33rnu74YbJJ2T To: /home/studio-lab-user/project/classic4.csv 100%|██████████████████████████████████████| 5.02M/5.02M [00:00<00:00, 31.0MB/s] time: 10.6 s (started: 2021-12-05 11:57:01 +00:00)
from gensim.models import KeyedVectors
import pandas as pd
import numpy as np
import warnings
import re
from utils_py import static_document_embeddings, tokenize_re
time: 4.24 s (started: 2021-12-05 13:16:21 +00:00)
df = pd.read_csv("./classic4.csv")
df
texts = df['text'].values
classic4_labels = df["label"].values
k_classic4 = len(df['label'].unique())
print(k_classic4)
df.info()
4 <class 'pandas.core.frame.DataFrame'> RangeIndex: 7095 entries, 0 to 7094 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 7095 non-null int64 1 text 7095 non-null object 2 label 7095 non-null object dtypes: int64(1), object(2) memory usage: 166.4+ KB time: 83.9 ms (started: 2021-12-05 13:16:25 +00:00)
df_bbc = pd.read_csv("./bbc.csv")
df_bbc
texts_bbc = df_bbc['text'].values
bbc_labels = df_bbc["label"].values
k_bbc = len(df_bbc['label'].unique())
print(k_bbc)
df_bbc.info()
5 <class 'pandas.core.frame.DataFrame'> RangeIndex: 2225 entries, 0 to 2224 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 2225 non-null int64 1 text 2225 non-null object 2 label 2225 non-null object dtypes: int64(1), object(2) memory usage: 52.3+ KB time: 55.7 ms (started: 2021-12-05 15:20:29 +00:00)
Read the embeddings saved before on Google Drive
! gdown "https://drive.google.com/uc?id=1270jrqfjC_j9jbujB8n6bTovnqCBG1Xp"
! gdown "https://drive.google.com/uc?id=1--FNn99G1Uut0R1v6mFqiuBi_pG-jC6o"
Downloading... From: https://drive.google.com/uc?id=1270jrqfjC_j9jbujB8n6bTovnqCBG1Xp To: /home/studio-lab-user/project/classic4_word2vec.h5 100%|███████████████████████████████████████| 17.1M/17.1M [00:00<00:00, 102MB/s] Downloading... From: https://drive.google.com/uc?id=1--FNn99G1Uut0R1v6mFqiuBi_pG-jC6o To: /home/studio-lab-user/project/bbc_word2vec.h5 100%|██████████████████████████████████████| 2.70M/2.70M [00:00<00:00, 20.5MB/s] time: 8.1 s (started: 2021-12-05 11:57:16 +00:00)
classic4_word2vec = pd.read_hdf("classic4_word2vec.h5").to_numpy()
bbc_word2vec = pd.read_hdf("bbc_word2vec.h5").to_numpy()
time: 322 ms (started: 2021-12-05 13:16:26 +00:00)
! gdown "https://drive.google.com/uc?id=1-04KQb6ykGW20yXI8dSZk5YJQKT2CRGh"
! gdown "https://drive.google.com/uc?id=1-0eXYefFHTliAuFfrRX8g2gnkYlF3Z5H"
Downloading... From: https://drive.google.com/uc?id=1-04KQb6ykGW20yXI8dSZk5YJQKT2CRGh To: /home/studio-lab-user/project/bbc_glove.h5 100%|██████████████████████████████████████| 2.70M/2.70M [00:00<00:00, 20.1MB/s] Downloading... From: https://drive.google.com/uc?id=1-0eXYefFHTliAuFfrRX8g2gnkYlF3Z5H To: /home/studio-lab-user/project/classic4_glove.h5 100%|██████████████████████████████████████| 17.1M/17.1M [00:00<00:00, 54.0MB/s] time: 10 s (started: 2021-12-05 11:57:24 +00:00)
classic4_glove = pd.read_hdf("classic4_glove.h5").to_numpy()
bbc_glove = pd.read_hdf("bbc_glove.h5").to_numpy()
time: 103 ms (started: 2021-12-05 13:16:28 +00:00)
word2vec_model = KeyedVectors.load_word2vec_format("./GoogleNews-vectors-negative300.bin.gz", binary=True)
get_word2vec_vector = lambda x: word2vec_model[x] if x in word2vec_model.vocab else None
classic4_word2vec = static_document_embeddings(get_word2vec_vector, texts, tokenize_re)
classic4_word2vec.shape
bbc_word2vec = static_document_embeddings(get_word2vec_vector, texts_bbc, tokenize_re)
bbc_word2vec.shape
from google.colab import drive
drive.mount('/content/drive')
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password) 728 try: --> 729 ident, reply = self.session.recv(self.stdin_socket, 0) 730 except Exception: /usr/local/lib/python3.7/dist-packages/jupyter_client/session.py in recv(self, socket, mode, content, copy) 802 try: --> 803 msg_list = socket.recv_multipart(mode, copy=copy) 804 except zmq.ZMQError as e: /usr/local/lib/python3.7/dist-packages/zmq/sugar/socket.py in recv_multipart(self, flags, copy, track) 624 """ --> 625 parts = [self.recv(flags, copy=copy, track=track)] 626 # have first part already, only loop while more to receive zmq/backend/cython/socket.pyx in zmq.backend.cython.socket.Socket.recv() zmq/backend/cython/socket.pyx in zmq.backend.cython.socket.Socket.recv() zmq/backend/cython/socket.pyx in zmq.backend.cython.socket._recv_copy() /usr/local/lib/python3.7/dist-packages/zmq/backend/cython/checkrc.pxd in zmq.backend.cython.checkrc._check_rc() KeyboardInterrupt: During handling of the above exception, another exception occurred: KeyboardInterrupt Traceback (most recent call last) <ipython-input-5-d5df0069828e> in <module>() 1 from google.colab import drive ----> 2 drive.mount('/content/drive') /usr/local/lib/python3.7/dist-packages/google/colab/drive.py in mount(mountpoint, force_remount, timeout_ms, use_metadata_server) 111 timeout_ms=timeout_ms, 112 use_metadata_server=use_metadata_server, --> 113 ephemeral=ephemeral) 114 115 /usr/local/lib/python3.7/dist-packages/google/colab/drive.py in _mount(mountpoint, force_remount, timeout_ms, use_metadata_server, ephemeral) 290 with _output.use_tags('dfs-auth-dance'): 291 with open(fifo, 'w') as fifo_file: --> 292 fifo_file.write(get_code(auth_prompt) + '\n') 293 wrote_to_fifo = True 294 elif case == 5: /usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py in raw_input(self, prompt) 702 self._parent_ident, 703 self._parent_header, --> 704 password=False, 705 ) 706 /usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password) 732 except KeyboardInterrupt: 733 # re-raise KeyboardInterrupt, to truncate traceback --> 734 raise KeyboardInterrupt 735 else: 736 break KeyboardInterrupt:
pd.DataFrame(classic4_word2vec).to_hdf('./drive/MyDrive/MLDS/CoClust/classic4_word2vec.h5', key='df', mode='w')
pd.DataFrame(bbc_word2vec).to_hdf('./drive/MyDrive/MLDS/CoClust/bbc_word2vec.h5', key='df', mode='w')
!wget -c "https://nlp.stanford.edu/data/glove.840B.300d.zip"
! unzip glove.840B.300d.zip
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec
glove_file = 'glove.840B.300d.txt'
tmp_file = "./glove.840B.300d.word2vec"
_ = glove2word2vec(glove_file, tmp_file)
glove_model = KeyedVectors.load_word2vec_format(tmp_file, binary=False)
!cp "./glove.840B.300d.word2vec" "./drive/MyDrive/MLDS/CoClust/glove.840B.300d.word2vec"
# glove_model = KeyedVectors.load_word2vec_format("./glove.6B.300d.word2vec", binary=False)
get_glove_vector = lambda x: glove_model[x] if x in glove_model.vocab else None
classic4_glove = static_document_embeddings(get_glove_vector, texts, tokenize_re)
classic4_glove.shape
bbc_glove = static_document_embeddings(get_glove_vector, texts_bbc, tokenize_re)
bbc_glove.shape
pd.DataFrame(classic4_glove).to_hdf('./drive/MyDrive/MLDS/CoClust/classic4_glove.h5', key='df', mode='w')
pd.DataFrame(bbc_glove).to_hdf('./drive/MyDrive/MLDS/CoClust/bbc_glove.h5', key='df', mode='w')
from utils_py import collate_fn, encode_sentences_batch
from transformers import BertModel, BertTokenizer, RobertaModel, RobertaTokenizer, AlbertTokenizer, AlbertModel
from torch.utils.data import DataLoader
import torch
if torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'
model_name = 'bert-base-cased'
model = BertModel.from_pretrained(model_name, output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained(model_name)
if device == 'cuda':
model.cuda()
classic4_bert = encode_sentences_batch(model, tokenizer, texts, batch_size=32)
# concatenated embeddings
classic4_bert_concat = np.concatenate(classic4_bert, axis=1)
classic4_bert_concat.shape
bbc_bert = encode_sentences_batch(model, tokenizer, texts_bbc, batch_size=32)
# concatenated embeddings
bbc_bert_concat = np.concatenate(bbc_bert, axis=1)
bbc_bert_concat.shape
pd.DataFrame(classic4_bert_concat).to_hdf('./drive/MyDrive/MLDS/CoClust/classic4_bert.h5', key='df', mode='w')
pd.DataFrame(bbc_bert_concat).to_hdf('./drive/MyDrive/MLDS/CoClust/bbc_bert.h5', key='df', mode='w')
model_name = 'roberta-base'
modelRoberta = RobertaModel.from_pretrained(model_name, output_hidden_states=True)
tokenizerRoberta = RobertaTokenizer.from_pretrained(model_name)
if device == 'cuda':
modelRoberta.cuda()
classic4_roberta = encode_sentences_batch(modelRoberta, tokenizerRoberta, texts, batch_size=32)
# concatenated
classic4_roberta_concat = np.concatenate(classic4_roberta, axis=1)
classic4_roberta_concat.shape
bbc_roberta = encode_sentences_batch(modelRoberta, tokenizerRoberta, texts_bbc, batch_size=32)
# concatenated
bbc_roberta_concat = np.concatenate(bbc_roberta, axis=1)
bbc_roberta_concat.shape
pd.DataFrame(classic4_roberta_concat).to_hdf('./drive/MyDrive/MLDS/CoClust/classic4_roberta.h5', key='df', mode='w')
pd.DataFrame(bbc_roberta_concat).to_hdf('./drive/MyDrive/MLDS/CoClust/bbc_roberta.h5', key='df', mode='w')
pip install sentencepiece
model_name = 'albert-base-v2'
modelAlbert = AlbertModel.from_pretrained(model_name, output_hidden_states=True)
tokenizerAlbert = AlbertTokenizer.from_pretrained(model_name)
if device == 'cuda':
modelRoberta.cuda()
classic4_albert = encode_sentences_batch(modelAlbert, tokenizerAlbert, texts, batch_size=32)
# concatenated
classic4_roberta_concat = np.concatenate(classic4_roberta, axis=1)
classic4_roberta_concat.shape
bbc_roberta = encode_sentences_batch(modelRoberta, tokenizerRoberta, texts_bbc, batch_size=32)
# concatenated
bbc_roberta_concat = np.concatenate(bbc_roberta, axis=1)
bbc_roberta_concat.shape
pd.DataFrame(classic4_roberta_concat).to_hdf('./drive/MyDrive/MLDS/CoClust/classic4_roberta.h5', key='df', mode='w')
pd.DataFrame(bbc_roberta_concat).to_hdf('./drive/MyDrive/MLDS/CoClust/bbc_roberta.h5', key='df', mode='w')
from torch.utils.data import DataLoader
import numpy as np
import torch
from tqdm import tqdm
class AutoEncoder(torch.nn.Module):
def __init__(self, input_dim, embedding_dim):
super().__init__()
self.input_dim = input_dim
self.embedding_dim = embedding_dim
self.encoder = torch.nn.Sequential(
torch.nn.Linear(self.input_dim, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, self.embedding_dim)
)
self.decoder = torch.nn.Sequential(
torch.nn.Linear(self.embedding_dim, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, self.input_dim),
torch.nn.Sigmoid()
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
def autoencoder(X, embedding_dim, n_epochs=50, batch_size=64, learning_rate=1e-3, weight_decay=1e-8, seed=None, return_model=False):
if torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'
if seed is not None:
torch.manual_seed(seed)
dataloader = DataLoader(dataset=X, batch_size=batch_size, shuffle=True)
model = AutoEncoder(input_dim=X.shape[1], embedding_dim=embedding_dim)
if device == 'cuda':
model.cuda()
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),
lr=learning_rate,
weight_decay=weight_decay)
losses = []
model.train()
for epoch in tqdm(range(n_epochs)):
for batch in dataloader:
batch = batch.to(device)
reconstructed = model(batch)
loss = loss_function(reconstructed, batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss)
model.eval()
tensor_X = torch.tensor(X)
tensor_X = tensor_X.to(device)
encoded_X = model.encoder(tensor_X)
encoded_X = encoded_X.detach().cpu().numpy()
if return_model:
return encoded_X, model, losses
else:
return encoded_X
time: 1.96 ms (started: 2021-12-05 13:16:33 +00:00)
pip install umap-learn
Requirement already satisfied: umap-learn in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (0.5.2) Requirement already satisfied: scipy>=1.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from umap-learn) (1.7.3) Requirement already satisfied: numba>=0.49 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from umap-learn) (0.54.1) Requirement already satisfied: pynndescent>=0.5 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from umap-learn) (0.5.5) Requirement already satisfied: scikit-learn>=0.22 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from umap-learn) (1.0.1) Requirement already satisfied: tqdm in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from umap-learn) (4.62.3) Requirement already satisfied: numpy>=1.17 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from umap-learn) (1.20.0) Requirement already satisfied: setuptools in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from numba>=0.49->umap-learn) (59.2.0) Requirement already satisfied: llvmlite<0.38,>=0.37.0rc1 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from numba>=0.49->umap-learn) (0.37.0) Requirement already satisfied: joblib>=0.11 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from pynndescent>=0.5->umap-learn) (1.1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from scikit-learn>=0.22->umap-learn) (3.0.0) Note: you may need to restart the kernel to use updated packages. time: 1.52 s (started: 2021-12-04 23:20:34 +00:00)
pip install hdbscan
Requirement already satisfied: hdbscan in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (0.8.27) Requirement already satisfied: scipy>=1.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from hdbscan) (1.7.3) Requirement already satisfied: cython>=0.27 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from hdbscan) (0.29.24) Requirement already satisfied: six in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from hdbscan) (1.16.0) Requirement already satisfied: joblib>=1.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from hdbscan) (1.1.0) Requirement already satisfied: scikit-learn>=0.20 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from hdbscan) (1.0.1) Requirement already satisfied: numpy>=1.16 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from hdbscan) (1.20.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages (from scikit-learn>=0.20->hdbscan) (3.0.0) Note: you may need to restart the kernel to use updated packages. time: 1.44 s (started: 2021-12-04 23:20:35 +00:00)
import hdbscan
time: 241 ms (started: 2021-12-05 13:16:37 +00:00)
from sklearn.cluster import KMeans ,AgglomerativeClustering, SpectralClustering
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
import numpy as np
from sklearn.metrics.cluster import normalized_mutual_info_score, adjusted_rand_score
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from umap import UMAP
from utils_py import accuracy
time: 6.19 s (started: 2021-12-05 13:16:37 +00:00)
def map_labels(labels):
mapping = {}
for (i,name) in enumerate(set(labels)):
mapping[name] = i
return [mapping[letter] for letter in labels]
time: 505 µs (started: 2021-12-05 13:16:43 +00:00)
def eval_clustering_2D(x, labels, pred_labels, methods):
fig, axes = plt.subplots(int((len(pred_labels)+1)/3) if not (len(pred_labels)+1)%3 else int((len(pred_labels)+1)/3+1) , 3, figsize=(20,25))
axes = [item for sublist in axes for item in sublist]
axes[0].scatter(x[:,0], x[:,1],
c=labels, edgecolor='none', alpha=1,)
axes[0].title.set_text('Real Labels')
results = {}
for i in range(len(pred_labels)):
nmi = normalized_mutual_info_score(labels, pred_labels[i])
ari = adjusted_rand_score(labels, pred_labels[i])
acc = accuracy(labels, pred_labels[i])
results[methods[i]] = (nmi,ari,acc)
axes[i+1].scatter(x[:,0], x[:,1],
c=pred_labels[i], edgecolor='none', alpha=1,)
axes[i+1].title.set_text(f'{methods[i]}\nNMI = {nmi}\nARI={ari}\nAcc={acc}')
plt.show()
return results
time: 3.94 ms (started: 2021-12-05 13:16:43 +00:00)
# def run_clustering(X, k, labels):
# kmeans_labels = KMeans(k, random_state=42).fit(X).labels_
# spectral_labels = SpectralClustering(k, n_components= X.shape[1], assign_labels="discretize", random_state=42).fit(X).labels_
# hdbscan_labels = hdbscan.HDBSCAN(algorithm="best", alpha=1.0, leaf_size=40, cluster_selection_method="eom", metric="euclidean").fit(X).labels_
# cah_labels_ward = AgglomerativeClustering(n_clusters=k).fit_predict(X)
# cah_labels_complete = AgglomerativeClustering(n_clusters=k, linkage="complete").fit_predict(X)
# cah_labels_average = AgglomerativeClustering(n_clusters=k, linkage="average").fit_predict(X)
# cah_labels_single = AgglomerativeClustering(n_clusters=k, linkage="single").fit_predict(X)
# eval_clustering_2D(X, labels, [kmeans_labels, spectral_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single], ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)"])
time: 478 µs (started: 2021-12-05 13:16:43 +00:00)
def run_clustering(X, k, labels, spectral=False):
kmeans_labels = KMeans(k, random_state=42).fit(X).labels_
hdbscan_labels = hdbscan.HDBSCAN(algorithm="best", alpha=1.0, leaf_size=40, cluster_selection_method="eom", metric="euclidean").fit(X).labels_
cah_labels_ward = AgglomerativeClustering(n_clusters=k).fit_predict(X)
cah_labels_complete = AgglomerativeClustering(n_clusters=k, linkage="complete").fit_predict(X)
cah_labels_average = AgglomerativeClustering(n_clusters=k, linkage="average").fit_predict(X)
cah_labels_single = AgglomerativeClustering(n_clusters=k, linkage="single").fit_predict(X)
gaussian = GaussianMixture(n_components=k, random_state=42).fit_predict(X)
if spectral:
spectral_labels = SpectralClustering(k, assign_labels="discretize", random_state=42).fit(X).labels_
return eval_clustering_2D(X, labels, [kmeans_labels, spectral_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single, gaussian], ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)", "MMG"])
else:
return eval_clustering_2D(X, labels, [kmeans_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single, gaussian], ["Kmeans", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)", "MMG"])
time: 1.17 ms (started: 2021-12-05 13:16:43 +00:00)
methods = ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)"]
methods_noSpec = ["Kmeans", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)"]
time: 471 µs (started: 2021-12-05 13:16:43 +00:00)
results_classic4_redim = {}
results_classic4_glove_redim = {}
time: 404 µs (started: 2021-12-05 13:28:24 +00:00)
df.groupby("label").size()
label cacm 3204 cisi 1460 cran 1398 med 1033 dtype: int64
time: 8.45 ms (started: 2021-12-05 12:04:35 +00:00)
Nous pouvons voir que les classes ne sont pas de la même taille, ce qui rend la mesure de la précision inappropriée pour l'évaluation des classificateurs.
X_reduced = PCA(n_components=2, random_state=42).fit_transform(classic4_word2vec)
kmeans_labels = KMeans(k_classic4, random_state=42).fit(classic4_word2vec).labels_
spectral_labels = SpectralClustering(k_classic4, n_components= classic4_word2vec.shape[1], assign_labels="discretize", random_state=42).fit(classic4_word2vec).labels_
hdbscan_labels = hdbscan.HDBSCAN(algorithm="best", alpha=1.0, leaf_size=40, cluster_selection_method="eom", metric="euclidean").fit(classic4_word2vec).labels_
cah_labels_ward = AgglomerativeClustering(n_clusters=k_classic4).fit_predict(classic4_word2vec)
cah_labels_complete = AgglomerativeClustering(n_clusters=k_classic4, linkage="complete").fit_predict(classic4_word2vec)
cah_labels_average = AgglomerativeClustering(n_clusters=k_classic4, linkage="average").fit_predict(classic4_word2vec)
cah_labels_single = AgglomerativeClustering(n_clusters=k_classic4, linkage="single").fit_predict(classic4_word2vec)
gaussian = GaussianMixture(n_components=k_classic4, random_state=42).fit_predict(classic4_word2vec)
time: 3min 22s (started: 2021-12-05 12:04:35 +00:00)
results_classic4_redim["original"] = eval_clustering_2D(X_reduced, map_labels(classic4_labels), [kmeans_labels, spectral_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single, gaussian], ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)", "MMG"])
time: 1.52 s (started: 2021-12-05 12:07:58 +00:00)
En visualisant l'espace en utilisant l'ACP à 2 dimensions, nous pouvons voir que les classes sont difficiles à séparer.
Les résultats ci-dessus montrent que les différentes méthodes de classification utilisées ne sont pas capables de séparer les classes dans leur espace d'origine, car la plupart d'entre elles sont proches de l'estimation aléatoire des étiquettes (ARI proche de 0 ou <0). Il est intéressant de noter que le modèle MMG fait mieux que les autres modèles, mais qu'il n'apporte toujours pas d'amélioration significative par rapport aux autres modèles.
Nous pouvons également remarquer que la précision n'est pas la meilleure métrique pour évaluer la performance des classificateurs car la classe "cacm" a le plus d'individus, confirmant ainsi l'hypothèse que nous soupçonnions précédemment.
Nous n'exécuterons pas le clustering spectral dans les expériences suivantes en raison de ses temps d'exécution élevés.
PCA avec 2 composantes
X_reduced = PCA(n_components=2, random_state=42).fit_transform(classic4_word2vec)
results_classic4_redim['pca2'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 7.76 s (started: 2021-12-05 12:07:59 +00:00)
PCA avec 20 composantes
X_reduced = PCA(n_components=20, random_state=42).fit_transform(classic4_word2vec)
results_classic4_redim['pca20'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 11.4 s (started: 2021-12-05 12:08:07 +00:00)
Nous pouvons voir que l'utilisation d'une ACP à 2 ou 20 composantes n'aide pas à améliorer les performances des classifieurs utilisés puisque l'espace dans lequel nous projetons les données n'est pas bon pour séparer les différents classifications.
Tout comme dans l'espace original, le MMG fait mieux que les autres classificateurs.
Passer de 2 à 20 composantes n'améliore pas la performance des classificateurs sauf pour le MMG qui améliore significativement l'ARI et le NMI.
TSNE
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(classic4_word2vec)
results_classic4_redim['tsne'] = run_clustering(X_reduced_tsne, k_classic4, map_labels(classic4_labels))
/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
time: 49.9 s (started: 2021-12-05 12:08:19 +00:00)
La différence de performance entre les projections PCA et tSNE varie entre les différents modèles.
La précision pour Kmeans, CAH et MMG est la même (75%). Si nous utilisons le NMI pour comparer les modèles, nous pouvons voir que le modèle MMG est toujours le modèle le plus performant, mais par une petite marge.
UMAP
X_reduced = UMAP(n_components=2, random_state=42).fit_transform(classic4_word2vec)
time: 22 s (started: 2021-12-05 12:09:08 +00:00)
results_classic4_redim['umap'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 6.5 s (started: 2021-12-05 12:09:30 +00:00)
X_reduced = UMAP(n_components=20, random_state=42).fit_transform(classic4_word2vec)
time: 25.3 s (started: 2021-12-05 12:35:57 +00:00)
results_classic4_redim['umap20'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 8.01 s (started: 2021-12-05 12:36:22 +00:00)
L'espace issu de l'application de l'UMAP aux données a considérablement amélioré les performances de tous les modèles de l'espace original.
Autoencoder
n_components=2
X_reduced = autoencoder(classic4_word2vec.astype("float32"), n_components, seed=42, learning_rate=1e-3)
100%|██████████| 50/50 [00:13<00:00, 3.65it/s]
time: 13.8 s (started: 2021-12-05 12:38:21 +00:00)
results_classic4_redim['autoencoder'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 6.27 s (started: 2021-12-05 12:38:35 +00:00)
Nous avons testé différentes architectures pour l'autoencodeur :
D'après les résultats de l'application des différentes méthodes de classification sur l'espace issu de l'autoencodeur, nous pouvons voir :
from copy import deepcopy
results_temp = deepcopy(results_classic4_redim)
for method in results_temp:
for result in results_temp[method]:
nmi, ari, acc = results_temp[method][result]
results_temp[method][result] = {
'nmi': nmi,
"ari": ari,
'acc': acc
}
time: 1.06 ms (started: 2021-12-05 12:38:41 +00:00)
results_word2vec_classic4 = pd.DataFrame.from_dict({(i,j): results_temp[i][j]
for i in results_temp.keys()
for j in results_temp[i].keys()},
orient='index')
time: 2.91 ms (started: 2021-12-05 12:38:41 +00:00)
Tableau des metriques
results_word2vec_classic4
| nmi | ari | acc | ||
|---|---|---|---|---|
| original | Kmeans | 0.218916 | -0.049318 | 0.330937 |
| Spectral Clustering | 0.156745 | -0.104473 | 0.238196 | |
| HDBSCAN | 0.204511 | -0.028608 | 0.384073 | |
| CAH (Ward) | 0.221556 | -0.049277 | 0.321212 | |
| CAH (Complete) | 0.002416 | -0.001646 | 0.449612 | |
| CAH (Average) | 0.000522 | -0.000354 | 0.451163 | |
| CAH (Single) | 0.000522 | -0.000354 | 0.451163 | |
| MMG | 0.467672 | 0.185057 | 0.500775 | |
| pca2 | Kmeans | 0.218993 | -0.048336 | 0.335870 |
| HDBSCAN | 0.175478 | -0.078413 | 0.288795 | |
| CAH (Ward) | 0.215684 | -0.052603 | 0.323467 | |
| CAH (Complete) | 0.218577 | -0.031889 | 0.377308 | |
| CAH (Average) | 0.227074 | -0.011426 | 0.402537 | |
| CAH (Single) | 0.011944 | -0.008089 | 0.441579 | |
| MMG | 0.219745 | -0.043005 | 0.327414 | |
| pca20 | Kmeans | 0.218894 | -0.049439 | 0.330233 |
| HDBSCAN | 0.206966 | -0.022090 | 0.390416 | |
| CAH (Ward) | 0.225017 | -0.031673 | 0.374066 | |
| CAH (Complete) | 0.061478 | -0.037469 | 0.395490 | |
| CAH (Average) | 0.012434 | -0.008417 | 0.441156 | |
| CAH (Single) | 0.000522 | -0.000354 | 0.451163 | |
| MMG | 0.574315 | 0.377036 | 0.586469 | |
| tsne | Kmeans | 0.619635 | 0.450235 | 0.752925 |
| HDBSCAN | 0.351751 | 0.145171 | 0.366737 | |
| CAH (Ward) | 0.628455 | 0.480589 | 0.782805 | |
| CAH (Complete) | 0.568745 | 0.411789 | 0.729528 | |
| CAH (Average) | 0.681420 | 0.496782 | 0.753206 | |
| CAH (Single) | 0.461908 | 0.239177 | 0.585060 | |
| MMG | 0.685889 | 0.502425 | 0.759831 | |
| umap | Kmeans | 0.681254 | 0.493522 | 0.749965 |
| HDBSCAN | 0.534970 | 0.378280 | 0.607752 | |
| CAH (Ward) | 0.666850 | 0.492157 | 0.750951 | |
| CAH (Complete) | 0.719314 | 0.581389 | 0.772234 | |
| CAH (Average) | 0.445825 | 0.235151 | 0.581536 | |
| CAH (Single) | 0.015430 | -0.010597 | 0.438478 | |
| MMG | 0.661623 | 0.490647 | 0.751515 | |
| autoencoder | Kmeans | 0.475146 | 0.288165 | 0.516843 |
| HDBSCAN | 0.171076 | -0.082856 | 0.307259 | |
| CAH (Ward) | 0.496066 | 0.295059 | 0.521353 | |
| CAH (Complete) | 0.354249 | 0.189827 | 0.553347 | |
| CAH (Average) | 0.238456 | 0.021460 | 0.433263 | |
| CAH (Single) | 0.012113 | -0.007888 | 0.441860 | |
| MMG | 0.701295 | 0.508924 | 0.764623 | |
| umap20 | Kmeans | 0.757409 | 0.596716 | 0.778013 |
| HDBSCAN | 0.494973 | 0.362231 | 0.569415 | |
| CAH (Ward) | 0.757945 | 0.596690 | 0.778154 | |
| CAH (Complete) | 0.015430 | -0.010597 | 0.438478 | |
| CAH (Average) | 0.013849 | -0.009463 | 0.439887 | |
| CAH (Single) | 0.013849 | -0.009463 | 0.439887 | |
| MMG | 0.721141 | 0.509090 | 0.749260 |
time: 11.2 ms (started: 2021-12-05 12:38:41 +00:00)
Les meilleurs methodes selon les 3 metriques
metric = "ari"
results_word2vec_classic4[results_word2vec_classic4[metric] == results_word2vec_classic4[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap20 | Kmeans | 0.757409 | 0.596716 | 0.778013 |
time: 6.72 ms (started: 2021-12-05 12:38:41 +00:00)
metric = "nmi"
results_word2vec_classic4[results_word2vec_classic4[metric] == results_word2vec_classic4[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap20 | CAH (Ward) | 0.757945 | 0.59669 | 0.778154 |
time: 6.56 ms (started: 2021-12-05 12:38:41 +00:00)
metric = "acc"
results_word2vec_classic4[results_word2vec_classic4[metric] == results_word2vec_classic4[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| tsne | CAH (Ward) | 0.628455 | 0.480589 | 0.782805 |
time: 6.87 ms (started: 2021-12-05 12:38:41 +00:00)
Le meilleur compromi c'est le CAH (Ward) avec UMAP de 20 composantes
X_reduced_tsne = UMAP(n_components=20, random_state=42).fit_transform(classic4_word2vec)
best_labels = AgglomerativeClustering(n_clusters=k_classic4).fit_predict(X_reduced_tsne)
time: 27.7 s (started: 2021-12-05 13:17:55 +00:00)
from wordcloud import WordCloud, STOPWORDS
def print_wordcloud(X, labels):
temp_df = pd.DataFrame({
"text": X,
"labels": labels
})
for label in temp_df['labels'].unique():
alltext = ' '.join(temp_df[temp_df['labels']== label]['text'])
wordcloud = WordCloud().generate(alltext)
# Display the generated image:
print(f'Class: {label}')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
print(f'\n')
time: 37.9 ms (started: 2021-12-05 13:18:22 +00:00)
Visualization des wordclouds du meilleur clustering
print_wordcloud(df['text'].values, best_labels)
Class: 0
Class: 1
Class: 2
Class: 3
time: 3.73 s (started: 2021-12-05 13:18:22 +00:00)
print_wordcloud(df['text'].values, df['label'].values)
Class: cacm
Class: cisi
Class: med
Class: cran
time: 3.73 s (started: 2021-12-05 13:18:26 +00:00)
En visualisant les nuages de mots et en les comparant, nous pouvons voir que les classes prédites sont les suivantes :
X_reduced = PCA(n_components=2, random_state=42).fit_transform(classic4_glove)
kmeans_labels = KMeans(k_classic4, random_state=42).fit(classic4_glove).labels_
spectral_labels = SpectralClustering(k_classic4, n_components= classic4_glove.shape[1], assign_labels="discretize", random_state=42).fit(classic4_glove).labels_
hdbscan_labels = hdbscan.HDBSCAN(algorithm="best", alpha=1.0, leaf_size=40, cluster_selection_method="eom", metric="euclidean").fit(classic4_glove).labels_
cah_labels_ward = AgglomerativeClustering(n_clusters=k_classic4).fit_predict(classic4_glove)
cah_labels_complete = AgglomerativeClustering(n_clusters=k_classic4, linkage="complete").fit_predict(classic4_glove)
cah_labels_average = AgglomerativeClustering(n_clusters=k_classic4, linkage="average").fit_predict(classic4_glove)
cah_labels_single = AgglomerativeClustering(n_clusters=k_classic4, linkage="single").fit_predict(classic4_glove)
gaussian = GaussianMixture(n_components=k_classic4, random_state=42).fit_predict(classic4_glove)
time: 1min 54s (started: 2021-12-05 13:25:57 +00:00)
results_classic4_glove_redim["original"] = eval_clustering_2D(X_reduced, map_labels(classic4_labels), [kmeans_labels, spectral_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single, gaussian], ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)", "MMG"])
time: 1.71 s (started: 2021-12-05 13:28:34 +00:00)
En visualisant l'espace en utilisant l'ACP à 2 dimensions, nous pouvons voir que les classes sont difficiles à séparer.
Les résultats ci-dessus montrent que les différentes méthodes de classification utilisées ne sont pas capables de séparer les classes dans leur espace d'origine, car la plupart d'entre elles sont proches de l'estimation aléatoire des étiquettes (ARI proche de 0 ou <0). Il est intéressant de noter que les modèles Kmeans, CAH(Ward) et MMG font mieux que les autres modèles, mais qu'il n'apporte toujours pas d'amélioration significative pour faire des interpretations.
Nous pouvons également remarquer que la précision n'est pas la meilleure métrique pour évaluer la performance des classificateurs car la classe "cacm" a le plus d'individus, confirmant ainsi l'hypothèse que nous soupçonnions précédemment.
PCA avec 2 composantes
X_reduced = PCA(n_components=2, random_state=42).fit_transform(classic4_glove)
results_classic4_glove_redim['pca2'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 7.61 s (started: 2021-12-05 13:36:17 +00:00)
PCA avec 20 composantes
X_reduced = PCA(n_components=20, random_state=42).fit_transform(classic4_glove)
results_classic4_glove_redim['pca20'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 10.2 s (started: 2021-12-05 13:36:25 +00:00)
Nous pouvons voir que l'utilisation d'une ACP à 2 ou 20 composantes a dégradé les performances des classifieurs utilisés puisque l'espace dans lequel nous projetons les données n'est pas bon pour séparer les différents classifications.
Passer de 2 à 20 composantes n'améliore pas la performance des classificateurs.
Kmeans, CAH(Ward), et MMG sont les seuls a avoir a ARI positif mais faible.
TSNE
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(classic4_glove)
results_classic4_glove_redim['tsne'] = run_clustering(X_reduced_tsne, k_classic4, map_labels(classic4_labels))
/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
time: 52.3 s (started: 2021-12-05 13:36:35 +00:00)
La différence de performance entre les projections PCA et tSNE varie entre les différents modèles.
La précision pour Kmeans, CAH et MMG est la même (77%). Si nous utilisons le NMI pour comparer les modèles, nous pouvons voir que les modèles CAH prennent le dessus sur MMG mais par une petite marge.
UMAP
X_reduced = UMAP(n_components=2, random_state=42).fit_transform(classic4_glove)
time: 12 s (started: 2021-12-05 13:37:27 +00:00)
results_classic4_glove_redim['umap'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 6.15 s (started: 2021-12-05 13:37:39 +00:00)
X_reduced = UMAP(n_components=20, random_state=42).fit_transform(classic4_glove)
time: 25 s (started: 2021-12-05 13:37:46 +00:00)
results_classic4_glove_redim['umap20'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 8.03 s (started: 2021-12-05 13:38:11 +00:00)
L'espace issu de l'application de l'UMAP aux données a considérablement amélioré les performances de tous les modèles de l'espace original.
Autoencoder
n_components=2
X_reduced = autoencoder(classic4_glove.astype("float32"), n_components, seed=42, learning_rate=1e-3)
100%|██████████| 50/50 [00:13<00:00, 3.74it/s]
time: 13.4 s (started: 2021-12-05 13:38:19 +00:00)
results_classic4_glove_redim['autoencoder'] = run_clustering(X_reduced, k_classic4, map_labels(classic4_labels))
time: 6.35 s (started: 2021-12-05 13:38:32 +00:00)
from copy import deepcopy
results_temp = deepcopy(results_classic4_glove_redim)
for method in results_temp:
for result in results_temp[method]:
nmi, ari, acc = results_temp[method][result]
results_temp[method][result] = {
'nmi': nmi,
"ari": ari,
'acc': acc
}
time: 1.61 ms (started: 2021-12-05 13:38:38 +00:00)
results_glove_classic4 = pd.DataFrame.from_dict({(i,j): results_temp[i][j]
for i in results_temp.keys()
for j in results_temp[i].keys()},
orient='index')
time: 3.05 ms (started: 2021-12-05 13:38:38 +00:00)
Tableau des metriques
results_glove_classic4
| nmi | ari | acc | ||
|---|---|---|---|---|
| original | Kmeans | 0.546820 | 0.334095 | 0.515152 |
| Spectral Clustering | 0.152768 | -0.106210 | 0.248203 | |
| HDBSCAN | 0.208170 | -0.023696 | 0.390134 | |
| CAH (Ward) | 0.579104 | 0.354563 | 0.539112 | |
| CAH (Complete) | 0.000868 | -0.000590 | 0.450881 | |
| CAH (Average) | 0.000695 | -0.000472 | 0.451022 | |
| CAH (Single) | 0.000522 | -0.000354 | 0.451163 | |
| MMG | 0.469534 | 0.171702 | 0.461029 | |
| pca2 | Kmeans | 0.512299 | 0.311471 | 0.520226 |
| HDBSCAN | 0.178165 | -0.074093 | 0.321917 | |
| CAH (Ward) | 0.562863 | 0.344484 | 0.517125 | |
| CAH (Complete) | 0.213466 | -0.043428 | 0.322622 | |
| CAH (Average) | 0.224198 | -0.038851 | 0.353911 | |
| CAH (Single) | 0.001214 | -0.000825 | 0.450599 | |
| MMG | 0.542993 | 0.334099 | 0.526004 | |
| pca20 | Kmeans | 0.545204 | 0.333182 | 0.515011 |
| HDBSCAN | 0.210413 | -0.025683 | 0.387738 | |
| CAH (Ward) | 0.541665 | 0.331599 | 0.539958 | |
| CAH (Complete) | 0.081665 | -0.046789 | 0.375053 | |
| CAH (Average) | 0.001730 | -0.001178 | 0.450176 | |
| CAH (Single) | 0.000695 | -0.000472 | 0.451022 | |
| MMG | 0.486936 | 0.189055 | 0.480197 | |
| tsne | Kmeans | 0.645336 | 0.462023 | 0.768006 |
| HDBSCAN | 0.348307 | 0.141324 | 0.356730 | |
| CAH (Ward) | 0.736415 | 0.526349 | 0.769274 | |
| CAH (Complete) | 0.560571 | 0.417842 | 0.566737 | |
| CAH (Average) | 0.735323 | 0.525357 | 0.768851 | |
| CAH (Single) | 0.470281 | 0.249963 | 0.595208 | |
| MMG | 0.727456 | 0.525956 | 0.770120 | |
| umap | Kmeans | 0.733938 | 0.526762 | 0.771106 |
| HDBSCAN | 0.567488 | 0.395996 | 0.626216 | |
| CAH (Ward) | 0.733938 | 0.526762 | 0.771106 | |
| CAH (Complete) | 0.733938 | 0.526762 | 0.771106 | |
| CAH (Average) | 0.520546 | 0.244868 | 0.574348 | |
| CAH (Single) | 0.463947 | 0.246574 | 0.592953 | |
| MMG | 0.733433 | 0.525480 | 0.769556 | |
| umap20 | Kmeans | 0.732785 | 0.525912 | 0.770684 |
| HDBSCAN | 0.580139 | 0.399672 | 0.633263 | |
| CAH (Ward) | 0.732785 | 0.525912 | 0.770684 | |
| CAH (Complete) | 0.520676 | 0.245184 | 0.574771 | |
| CAH (Average) | 0.463068 | 0.246204 | 0.592812 | |
| CAH (Single) | 0.463068 | 0.246204 | 0.592812 | |
| MMG | 0.732785 | 0.525912 | 0.770684 | |
| autoencoder | Kmeans | 0.520456 | 0.337159 | 0.681748 |
| HDBSCAN | 0.177203 | -0.079225 | 0.315997 | |
| CAH (Ward) | 0.527722 | 0.346146 | 0.694996 | |
| CAH (Complete) | 0.400839 | 0.236166 | 0.510078 | |
| CAH (Average) | 0.234938 | 0.003085 | 0.416209 | |
| CAH (Single) | 0.001214 | -0.000825 | 0.450599 | |
| MMG | 0.470972 | 0.289866 | 0.657082 |
time: 10.5 ms (started: 2021-12-05 13:38:38 +00:00)
Les meilleurs methodes selon les 3 metriques
metric = "ari"
results_glove_classic4[results_glove_classic4[metric] == results_glove_classic4[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap | Kmeans | 0.733938 | 0.526762 | 0.771106 |
| CAH (Ward) | 0.733938 | 0.526762 | 0.771106 | |
| CAH (Complete) | 0.733938 | 0.526762 | 0.771106 |
time: 6.78 ms (started: 2021-12-05 13:48:54 +00:00)
metric = "nmi"
results_glove_classic4[results_glove_classic4[metric] == results_glove_classic4[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| tsne | CAH (Ward) | 0.736415 | 0.526349 | 0.769274 |
time: 8.96 ms (started: 2021-12-05 13:48:55 +00:00)
metric = "acc"
best_glove_classic4 = results_glove_classic4[results_glove_classic4[metric] == results_glove_classic4[metric].max()]
best_glove_classic4
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap | Kmeans | 0.733938 | 0.526762 | 0.771106 |
| CAH (Ward) | 0.733938 | 0.526762 | 0.771106 | |
| CAH (Complete) | 0.733938 | 0.526762 | 0.771106 |
time: 7.01 ms (started: 2021-12-05 16:06:02 +00:00)
Le meilleur model c'est le CAH (Ward) avec UMAP de 2 composantes ou TSNE
X_reduced_best = UMAP(n_components=2, random_state=42).fit_transform(classic4_glove)
best_labels = AgglomerativeClustering(n_clusters=k_classic4).fit_predict(X_reduced_best)
time: 13.7 s (started: 2021-12-05 13:50:58 +00:00)
from wordcloud import WordCloud, STOPWORDS
def print_wordcloud(X, labels):
temp_df = pd.DataFrame({
"text": X,
"labels": labels
})
for label in temp_df['labels'].unique():
alltext = ' '.join(temp_df[temp_df['labels']== label]['text'])
wordcloud = WordCloud().generate(alltext)
# Display the generated image:
print(f'Class: {label}')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
print(f'\n')
time: 857 µs (started: 2021-12-05 13:51:12 +00:00)
Visualization des wordclouds du meilleur clustering
print_wordcloud(df['text'].values, best_labels)
Class: 0
Class: 2
Class: 1
Class: 3
time: 3.92 s (started: 2021-12-05 13:51:12 +00:00)
print_wordcloud(df['text'].values, df['label'].values)
Class: cacm
Class: cisi
Class: med
Class: cran
time: 3.91 s (started: 2021-12-05 13:51:16 +00:00)
En visualisant les nuages de mots et en les comparant, nous pouvons voir que les classes prédites sont les suivantes :
results_bbc_redim = {}
results_bbc_glove_redim = {}
time: 542 µs (started: 2021-12-05 15:20:41 +00:00)
df_bbc.groupby("label").size()
label business 510 entertainment 386 politics 417 sport 511 tech 401 dtype: int64
time: 4.57 ms (started: 2021-12-05 15:20:41 +00:00)
Nous pouvons voir que les classes sont plus ou moins de la même taille.
X_reduced = PCA(n_components=2, random_state=42).fit_transform(bbc_word2vec)
kmeans_labels = KMeans(k_bbc, random_state=42).fit(bbc_word2vec).labels_
spectral_labels = SpectralClustering(k_bbc, n_components= bbc_word2vec.shape[1], assign_labels="discretize", random_state=42).fit(bbc_word2vec).labels_
hdbscan_labels = hdbscan.HDBSCAN(algorithm="best", alpha=1.0, leaf_size=40, cluster_selection_method="eom", metric="euclidean").fit(bbc_word2vec).labels_
cah_labels_ward = AgglomerativeClustering(n_clusters=k_bbc).fit_predict(bbc_word2vec)
cah_labels_complete = AgglomerativeClustering(n_clusters=k_bbc, linkage="complete").fit_predict(bbc_word2vec)
cah_labels_average = AgglomerativeClustering(n_clusters=k_bbc, linkage="average").fit_predict(bbc_word2vec)
cah_labels_single = AgglomerativeClustering(n_clusters=k_bbc, linkage="single").fit_predict(bbc_word2vec)
gaussian = GaussianMixture(n_components=k_bbc, random_state=42).fit_predict(bbc_word2vec)
time: 13.8 s (started: 2021-12-05 15:20:41 +00:00)
results_bbc_redim["original"] = eval_clustering_2D(X_reduced, map_labels(bbc_labels), [kmeans_labels, spectral_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single, gaussian], ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)", "MMG"])
time: 1.27 s (started: 2021-12-05 15:20:55 +00:00)
En visualisant l'espace en utilisant l'ACP à 2 dimensions, nous pouvons voir que les classes sont moyennement difficle à séparer.
Les résultats ci-dessus montrent que les différentes méthodes de classification Spectral Clustering, HDBSCAN, CAH avec la moynne et le minimum ainsi que le maximum ne sont pas capables de séparer les classes dans leur espace d'origine, car la plupart d'entre elles sont proches de l'estimation aléatoire des étiquettes (ARI proche de 0 ou <0).
Il est intéressant de noter que les modèles Kmeans, MMG, et CAH(Ward) font mieux que les autres modèles, avec MMG qui prend le dessus sur Kmeans et CAH avec une NMI de 0.79, une ARI de 0.80 et une precision de 91%
Nous pouvons également remarquer que la précision n'est pas mal comme métrique pour évaluer la performance des classificateurs sur la dataset BBC.
Nous n'exécuterons pas le clustering spectral dans les tSNE en raison de ses temps d'exécution élevés.
PCA avec 2 composantes
X_reduced = PCA(n_components=2, random_state=42).fit_transform(bbc_word2vec)
results_bbc_redim['pca2'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 2.7 s (started: 2021-12-05 15:20:56 +00:00)
PCA avec 20 composantes
X_reduced = PCA(n_components=20, random_state=42).fit_transform(bbc_word2vec)
results_bbc_redim['pca20'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 3.33 s (started: 2021-12-05 15:20:59 +00:00)
TSNE
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(bbc_word2vec)
results_bbc_redim['tsne'] = run_clustering(X_reduced_tsne, k_bbc, map_labels(bbc_labels))
/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
time: 12 s (started: 2021-12-05 15:21:02 +00:00)
UMAP
X_reduced = UMAP(n_components=2, random_state=42).fit_transform(bbc_word2vec)
time: 6.64 s (started: 2021-12-05 15:21:14 +00:00)
results_bbc_redim['umap'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 2.45 s (started: 2021-12-05 15:21:21 +00:00)
X_reduced = UMAP(n_components=20, random_state=42).fit_transform(bbc_word2vec)
time: 10.8 s (started: 2021-12-05 15:21:23 +00:00)
results_bbc_redim['umap0'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 3.04 s (started: 2021-12-05 15:21:34 +00:00)
Autoencoder
n_components=2
X_reduced = autoencoder(bbc_word2vec.astype("float32"), n_components, seed=42, learning_rate=1e-3)
100%|██████████| 50/50 [00:04<00:00, 12.03it/s]
time: 4.18 s (started: 2021-12-05 15:21:37 +00:00)
results_bbc_redim['autoencoder'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels))
time: 1.64 s (started: 2021-12-05 15:21:41 +00:00)
L'espace représenté par l'autoencodeur est moins bon que l'espace original en termes de séparabilité ce qui justifie les mauvaises performances de tous les modeles.
from copy import deepcopy
results_temp = deepcopy(results_bbc_redim)
for method in results_temp:
for result in results_temp[method]:
nmi, ari, acc = results_temp[method][result]
results_temp[method][result] = {
'nmi': nmi,
"ari": ari,
'acc': acc
}
time: 1.17 ms (started: 2021-12-05 15:33:29 +00:00)
results_word2vec_bbc = pd.DataFrame.from_dict({(i,j): results_temp[i][j]
for i in results_temp.keys()
for j in results_temp[i].keys()},
orient='index')
time: 3.58 ms (started: 2021-12-05 15:33:30 +00:00)
Tableau des metriques
results_word2vec_bbc
| nmi | ari | acc | ||
|---|---|---|---|---|
| original | Kmeans | 0.796654 | 0.798345 | 0.915506 |
| Spectral Clustering | 0.381393 | 0.031478 | 0.053483 | |
| HDBSCAN | 0.074266 | 0.028327 | 0.291685 | |
| CAH (Ward) | 0.792318 | 0.812097 | 0.916854 | |
| CAH (Complete) | 0.370294 | 0.274512 | 0.438202 | |
| CAH (Average) | 0.014482 | -0.000934 | 0.233708 | |
| CAH (Single) | 0.003276 | -0.000236 | 0.229663 | |
| MMG | 0.798214 | 0.800423 | 0.916404 | |
| pca2 | Kmeans | 0.564920 | 0.525932 | 0.733483 |
| Spectral Clustering | 0.553734 | 0.515027 | 0.727191 | |
| HDBSCAN | 0.022548 | 0.006957 | 0.254831 | |
| CAH (Ward) | 0.554063 | 0.489187 | 0.674157 | |
| CAH (Complete) | 0.434246 | 0.305676 | 0.608989 | |
| CAH (Average) | 0.517527 | 0.357733 | 0.577978 | |
| CAH (Single) | 0.006162 | -0.000033 | 0.231461 | |
| MMG | 0.626241 | 0.594462 | 0.763596 | |
| pca20 | Kmeans | 0.795542 | 0.797493 | 0.915056 |
| Spectral Clustering | 0.778342 | 0.794862 | 0.912809 | |
| HDBSCAN | 0.032800 | -0.001725 | 0.239101 | |
| CAH (Ward) | 0.798152 | 0.829710 | 0.927640 | |
| CAH (Complete) | 0.422425 | 0.280185 | 0.430112 | |
| CAH (Average) | 0.030833 | -0.001529 | 0.240000 | |
| CAH (Single) | 0.004089 | -0.000293 | 0.230112 | |
| MMG | 0.811845 | 0.823981 | 0.924944 | |
| tsne | Kmeans | 0.846693 | 0.870771 | 0.943820 |
| HDBSCAN | 0.433858 | 0.154393 | 0.347865 | |
| CAH (Ward) | 0.822428 | 0.842449 | 0.930787 | |
| CAH (Complete) | 0.807556 | 0.816937 | 0.919101 | |
| CAH (Average) | 0.847521 | 0.867654 | 0.943371 | |
| CAH (Single) | 0.478919 | 0.231575 | 0.447640 | |
| MMG | 0.849050 | 0.869827 | 0.944270 | |
| umap | Kmeans | 0.866753 | 0.889493 | 0.953708 |
| Spectral Clustering | 0.773931 | 0.642134 | 0.703820 | |
| HDBSCAN | 0.440565 | 0.093722 | 0.198652 | |
| CAH (Ward) | 0.854778 | 0.872792 | 0.946067 | |
| CAH (Complete) | 0.847302 | 0.861356 | 0.940674 | |
| CAH (Average) | 0.851725 | 0.869957 | 0.944719 | |
| CAH (Single) | 0.773931 | 0.642134 | 0.703820 | |
| MMG | 0.858385 | 0.878537 | 0.948764 | |
| umap0 | Kmeans | 0.873915 | 0.897055 | 0.956854 |
| Spectral Clustering | 0.811883 | 0.720567 | 0.784719 | |
| HDBSCAN | 0.506215 | 0.271883 | 0.366742 | |
| CAH (Ward) | 0.862929 | 0.882941 | 0.950562 | |
| CAH (Complete) | 0.873619 | 0.883432 | 0.951011 | |
| CAH (Average) | 0.811883 | 0.720567 | 0.784719 | |
| CAH (Single) | 0.712985 | 0.586191 | 0.608539 | |
| MMG | 0.873711 | 0.896928 | 0.956854 | |
| autoencoder | Kmeans | 0.448682 | 0.357277 | 0.603146 |
| HDBSCAN | 0.011783 | -0.000159 | 0.237303 | |
| CAH (Ward) | 0.422112 | 0.310587 | 0.521798 | |
| CAH (Complete) | 0.447198 | 0.354496 | 0.528989 | |
| CAH (Average) | 0.459243 | 0.356343 | 0.517303 | |
| CAH (Single) | 0.014261 | -0.000417 | 0.235056 | |
| MMG | 0.405727 | 0.280807 | 0.488539 |
time: 11.9 ms (started: 2021-12-05 15:33:31 +00:00)
Les meilleurs methodes selon les 3 metriques
metric = "ari"
results_word2vec_bbc[results_word2vec_bbc[metric] == results_word2vec_bbc[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap0 | Kmeans | 0.873915 | 0.897055 | 0.956854 |
time: 10.6 ms (started: 2021-12-05 15:33:33 +00:00)
metric = "nmi"
results_word2vec_bbc[results_word2vec_bbc[metric] == results_word2vec_bbc[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap0 | Kmeans | 0.873915 | 0.897055 | 0.956854 |
time: 7.73 ms (started: 2021-12-05 15:33:33 +00:00)
metric = "acc"
results_word2vec_bbc[results_word2vec_bbc[metric] == results_word2vec_bbc[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap0 | Kmeans | 0.873915 | 0.897055 | 0.956854 |
| MMG | 0.873711 | 0.896928 | 0.956854 |
time: 7.93 ms (started: 2021-12-05 15:33:35 +00:00)
Le meilleur model c'est Kmeans sur Umap de 20 composantes
X_reduced_best = UMAP(n_components=20, random_state=42).fit_transform(bbc_word2vec)
best_labels = KMeans(n_clusters=k_bbc, random_state=42).fit(X_reduced_best).labels_
time: 11 s (started: 2021-12-05 15:34:53 +00:00)
from wordcloud import WordCloud, STOPWORDS
def print_wordcloud(X, labels):
temp_df = pd.DataFrame({
"text": X,
"labels": labels
})
for label in temp_df['labels'].unique():
alltext = ' '.join(temp_df[temp_df['labels']== label]['text'])
wordcloud = WordCloud().generate(alltext)
# Display the generated image:
print(f'Class: {label}')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
print(f'\n')
time: 782 µs (started: 2021-12-05 15:35:07 +00:00)
Visualization des wordclouds du meilleur clustering
print_wordcloud(df_bbc['text'].values, best_labels)
Class: 1
Class: 0
Class: 3
Class: 2
Class: 4
time: 4.95 s (started: 2021-12-05 15:35:10 +00:00)
print_wordcloud(df_bbc['text'].values, df_bbc['label'].values)
Class: sport
Class: entertainment
Class: tech
Class: business
Class: politics
time: 5.07 s (started: 2021-12-05 15:35:19 +00:00)
En visualisant les nuages de mots et en les comparant, nous pouvons voir que les classes prédites sont les suivantes :
X_reduced = PCA(n_components=2, random_state=42).fit_transform(bbc_glove)
kmeans_labels = KMeans(k_bbc, random_state=42).fit(bbc_glove).labels_
spectral_labels = SpectralClustering(k_bbc, n_components= bbc_glove.shape[1], assign_labels="discretize", random_state=42).fit(bbc_glove).labels_
hdbscan_labels = hdbscan.HDBSCAN(algorithm="best", alpha=1.0, leaf_size=40, cluster_selection_method="eom", metric="euclidean").fit(bbc_glove).labels_
cah_labels_ward = AgglomerativeClustering(n_clusters=k_bbc).fit_predict(bbc_glove)
cah_labels_complete = AgglomerativeClustering(n_clusters=k_bbc, linkage="complete").fit_predict(bbc_glove)
cah_labels_average = AgglomerativeClustering(n_clusters=k_bbc, linkage="average").fit_predict(bbc_glove)
cah_labels_single = AgglomerativeClustering(n_clusters=k_bbc, linkage="single").fit_predict(bbc_glove)
gaussian = GaussianMixture(n_components=k_bbc, random_state=42).fit_predict(bbc_glove)
time: 16.2 s (started: 2021-12-05 15:39:12 +00:00)
results_bbc_glove_redim["original"] = eval_clustering_2D(X_reduced, map_labels(bbc_labels), [kmeans_labels, spectral_labels, hdbscan_labels, cah_labels_ward, cah_labels_complete, cah_labels_average, cah_labels_single, gaussian], ["Kmeans", "Spectral Clustering", "HDBSCAN", "CAH (Ward)", "CAH (Complete)","CAH (Average)", "CAH (Single)", "MMG"])
time: 1.3 s (started: 2021-12-05 15:39:57 +00:00)
En visualisant l'espace en utilisant l'ACP à 2 dimensions, nous pouvons voir que les classes sont difficiles à séparer.
Les résultats ci-dessus montrent que les différentes méthodes de classification utilisées ne sont pas capables de séparer les classes dans leur espace d'origine, car la plupart d'entre elles sont proches de l'estimation aléatoire des étiquettes (ARI proche de 0 ou <0). Il est intéressant de noter que les modèles Kmeans, CAH(Ward) et MMG font mieux que les autres modèles et arrivent a classifier 85% a 89% des donnees.
Le meilleur model etant MMG avec 89% accuracy et 0.737 NMI
PCA avec 2 composantes
X_reduced = PCA(n_components=2, random_state=42).fit_transform(bbc_glove)
results_bbc_glove_redim['pca2'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels))
time: 1.77 s (started: 2021-12-05 15:42:31 +00:00)
PCA avec 20 composantes
X_reduced = PCA(n_components=20, random_state=42).fit_transform(bbc_glove)
results_bbc_glove_redim['pca20'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels))
time: 3.11 s (started: 2021-12-05 15:42:58 +00:00)
Nous pouvons voir que l'utilisation d'une ACP à 2 ou 20 composantes a dégradé les performances des classifieurs utilisés puisque l'espace dans lequel nous projetons les données n'est pas bon pour séparer les différents classifications.
Passer de 2 à 20 composantes n'améliore pas la performance des classificateurs exception de Kmeans qui regagne ses performances et MMG qui s'ameliore avec une precision de 93%.
MMG est le meilleur model avec une NMI de 0.81.
TSNE
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(bbc_glove)
results_bbc_glove_redim['tsne'] = run_clustering(X_reduced_tsne, k_bbc, map_labels(bbc_labels))
/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
time: 11.4 s (started: 2021-12-05 15:44:56 +00:00)
UMAP
X_reduced = UMAP(n_components=2, random_state=42).fit_transform(bbc_glove)
time: 6.77 s (started: 2021-12-05 15:52:00 +00:00)
results_bbc_glove_redim['umap'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 2.73 s (started: 2021-12-05 15:52:07 +00:00)
X_reduced = UMAP(n_components=20, random_state=42).fit_transform(bbc_glove)
time: 10.3 s (started: 2021-12-05 15:52:10 +00:00)
results_bbc_glove_redim['umap20'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 2.58 s (started: 2021-12-05 15:52:20 +00:00)
Autoencoder
n_components=2
X_reduced = autoencoder(bbc_glove.astype("float32"), n_components, seed=42, learning_rate=1e-3)
100%|██████████| 50/50 [00:04<00:00, 11.71it/s]
time: 4.3 s (started: 2021-12-05 15:53:52 +00:00)
results_bbc_glove_redim['autoencoder'] = run_clustering(X_reduced, k_bbc, map_labels(bbc_labels), spectral=True)
time: 2.66 s (started: 2021-12-05 15:53:56 +00:00)
from copy import deepcopy
results_temp = deepcopy(results_bbc_glove_redim)
for method in results_temp:
for result in results_temp[method]:
nmi, ari, acc = results_temp[method][result]
results_temp[method][result] = {
'nmi': nmi,
"ari": ari,
'acc': acc
}
time: 1.1 ms (started: 2021-12-05 15:56:22 +00:00)
results_glove_bbc = pd.DataFrame.from_dict({(i,j): results_temp[i][j]
for i in results_temp.keys()
for j in results_temp[i].keys()},
orient='index')
time: 5.45 ms (started: 2021-12-05 15:56:29 +00:00)
Tableau des metriques
results_glove_bbc
| nmi | ari | acc | ||
|---|---|---|---|---|
| original | Kmeans | 0.735476 | 0.757107 | 0.895730 |
| Spectral Clustering | 0.417515 | 0.032330 | 0.043596 | |
| HDBSCAN | 0.046052 | 0.022061 | 0.280000 | |
| CAH (Ward) | 0.728462 | 0.668795 | 0.850787 | |
| CAH (Complete) | 0.224572 | 0.090191 | 0.349663 | |
| CAH (Average) | 0.016877 | -0.001018 | 0.235955 | |
| CAH (Single) | 0.004394 | -0.000046 | 0.230112 | |
| MMG | 0.737276 | 0.759297 | 0.896629 | |
| pca2 | Kmeans | 0.407170 | 0.331899 | 0.552360 |
| HDBSCAN | 0.014463 | 0.000336 | 0.237753 | |
| CAH (Ward) | 0.415824 | 0.337473 | 0.529888 | |
| CAH (Complete) | 0.385331 | 0.319554 | 0.549663 | |
| CAH (Average) | 0.366228 | 0.283102 | 0.462472 | |
| CAH (Single) | 0.004394 | -0.000046 | 0.230112 | |
| MMG | 0.428693 | 0.368565 | 0.652584 | |
| pca20 | Kmeans | 0.736676 | 0.758055 | 0.896180 |
| HDBSCAN | 0.376684 | 0.086681 | 0.426966 | |
| CAH (Ward) | 0.674327 | 0.646407 | 0.827865 | |
| CAH (Complete) | 0.441534 | 0.296157 | 0.509663 | |
| CAH (Average) | 0.011191 | -0.000785 | 0.232360 | |
| CAH (Single) | 0.004395 | -0.000045 | 0.230562 | |
| MMG | 0.814471 | 0.837951 | 0.931236 | |
| tsne | Kmeans | 0.823546 | 0.849094 | 0.936180 |
| HDBSCAN | 0.585269 | 0.476259 | 0.529888 | |
| CAH (Ward) | 0.827652 | 0.853111 | 0.938427 | |
| CAH (Complete) | 0.792379 | 0.793482 | 0.906966 | |
| CAH (Average) | 0.815322 | 0.834862 | 0.930337 | |
| CAH (Single) | 0.430015 | 0.211794 | 0.448539 | |
| MMG | 0.794644 | 0.817088 | 0.920899 | |
| umap | Kmeans | 0.870534 | 0.897714 | 0.956854 |
| Spectral Clustering | 0.794670 | 0.698213 | 0.775281 | |
| HDBSCAN | 0.416161 | 0.061448 | 0.182921 | |
| CAH (Ward) | 0.845860 | 0.867525 | 0.943371 | |
| CAH (Complete) | 0.823265 | 0.809267 | 0.912809 | |
| CAH (Average) | 0.794660 | 0.698506 | 0.776180 | |
| CAH (Single) | 0.665509 | 0.444819 | 0.605843 | |
| MMG | 0.869943 | 0.896554 | 0.956404 | |
| umap20 | Kmeans | 0.874608 | 0.901818 | 0.958652 |
| Spectral Clustering | 0.795120 | 0.697860 | 0.775730 | |
| HDBSCAN | 0.479458 | 0.236400 | 0.328090 | |
| CAH (Ward) | 0.868180 | 0.896627 | 0.956404 | |
| CAH (Complete) | 0.869015 | 0.892266 | 0.954607 | |
| CAH (Average) | 0.793575 | 0.697462 | 0.775730 | |
| CAH (Single) | 0.650290 | 0.450460 | 0.606292 | |
| MMG | 0.868389 | 0.892285 | 0.954607 | |
| autoencoder | Kmeans | 0.671442 | 0.659545 | 0.842247 |
| Spectral Clustering | 0.620896 | 0.528694 | 0.665618 | |
| HDBSCAN | 0.373276 | 0.130230 | 0.388764 | |
| CAH (Ward) | 0.651611 | 0.650360 | 0.839101 | |
| CAH (Complete) | 0.614533 | 0.502197 | 0.645843 | |
| CAH (Average) | 0.524004 | 0.340746 | 0.516854 | |
| CAH (Single) | 0.006485 | 0.000145 | 0.231461 | |
| MMG | 0.702056 | 0.721822 | 0.874157 |
time: 11.8 ms (started: 2021-12-05 15:56:33 +00:00)
Les meilleurs methodes selon les 3 metriques
metric = "ari"
results_glove_bbc[results_glove_bbc[metric] == results_glove_bbc[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap20 | Kmeans | 0.874608 | 0.901818 | 0.958652 |
time: 7.64 ms (started: 2021-12-05 15:56:41 +00:00)
metric = "nmi"
results_glove_bbc[results_glove_bbc[metric] == results_glove_bbc[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap20 | Kmeans | 0.874608 | 0.901818 | 0.958652 |
time: 7.35 ms (started: 2021-12-05 15:56:45 +00:00)
metric = "acc"
results_glove_bbc[results_glove_bbc[metric] == results_glove_bbc[metric].max()]
| nmi | ari | acc | ||
|---|---|---|---|---|
| umap20 | Kmeans | 0.874608 | 0.901818 | 0.958652 |
time: 7.06 ms (started: 2021-12-05 15:56:53 +00:00)
Le meilleur model c'est le Kmeans avec UMAP de 20 composantes
X_reduced_best = UMAP(n_components=20, random_state=42).fit_transform(bbc_glove)
best_labels = KMeans(n_clusters=k_bbc, random_state=42).fit(X_reduced_best).labels_
time: 10.6 s (started: 2021-12-05 15:58:04 +00:00)
from wordcloud import WordCloud, STOPWORDS
def print_wordcloud(X, labels):
temp_df = pd.DataFrame({
"text": X,
"labels": labels
})
for label in temp_df['labels'].unique():
alltext = ' '.join(temp_df[temp_df['labels']== label]['text'])
wordcloud = WordCloud().generate(alltext)
# Display the generated image:
print(f'Class: {label}')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
print(f'\n')
time: 1.43 ms (started: 2021-12-05 15:58:14 +00:00)
Visualization des wordclouds du meilleur clustering
print_wordcloud(df_bbc['text'].values, best_labels)
Class: 2
Class: 4
Class: 0
Class: 1
Class: 3
time: 4.91 s (started: 2021-12-05 15:58:25 +00:00)
print_wordcloud(df_bbc['text'].values, df_bbc['label'].values)
Class: sport
Class: entertainment
Class: tech
Class: business
Class: politics
time: 4.9 s (started: 2021-12-05 15:58:41 +00:00)
En visualisant les nuages de mots et en les comparant, nous pouvons voir que les classes prédites sont les suivantes :
Nous pouvons voir que le jeu de données bbc est plus facile à classer car ses calsses sont plus faciles à séparer et nous pouvons atteindre 95% de précision contre 77% pour classic4.